Extend power analysis to all estimators + simulation-based MDE/sample size by igerber · Pull Request #208 · igerber/diff-diff

igerber · 2026-03-18T12:59:21Z

Summary

Add estimator registry mapping all 12 built-in estimators (excluding BaconDecomposition) to their appropriate DGP, fit kwargs, and result extraction profiles
Refactor simulate_power() from hardcoded 4-estimator if/elif chain to registry-based lookup supporting all estimators automatically
Add simulate_mde() — bisection search over effect sizes to find the minimum detectable effect for any estimator
Add simulate_sample_size() — bisection search over n_units to find the required sample size for any estimator
Add SimulationMDEResults and SimulationSampleSizeResults dataclasses with summary(), to_dict(), to_dataframe() methods
34 new tests: registry unit tests, per-estimator coverage tests, MDE search tests, sample size search tests

Methodology references (required if estimator / math changes)

Method name(s): N/A — no methodology changes, this extends power analysis infrastructure only
Paper / source link(s): N/A
Any intentional deviations from the source (and why): None

Validation

Tests added/updated: tests/test_power.py — 34 new tests across 4 test classes (TestEstimatorRegistry, TestEstimatorCoverage, TestSimulateMDE, TestSimulateSampleSize)
All 76 tests pass (4 skipped: matplotlib, 2 deselected: slow-marked TROP/SDiD)
black, ruff check, and mypy all pass clean
Backtest / simulation / notebook evidence: N/A

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

github-actions · 2026-03-18T13:06:58Z

Overall Assessment

⛔ Blocker

Highest unmitigated severity is P0.

Executive Summary

The new ContinuousDiD registry profile is methodologically wrong for power analysis: it varies the DGP slope, but simulate_power() treats that scalar as the estimator’s scalar truth, even though the estimator returns overall_att. With defaults, treatment_effect=0 is still a non-null design.
The new simulate_mde() / simulate_sample_size() explicit-range paths can return unchecked upper bounds and power_at_* = 0.0 without ever evaluating the endpoint, because the supplied bracket is neither validated nor fully evaluated.
The added tests are mostly smoke tests. They do not cover ContinuousDiD null calibration or the explicit effect_range / n_range branches, so both defects can merge undetected.
I did not find a mitigating Methodology Registry note or TODO.md entry for either issue.

Path to Approval

Remove ContinuousDiD from the automatic power registry for this PR, or redesign that profile so the simulated DGP, extracted estimate, and reported true_effect all refer to the same documented scalar estimand.
In simulate_mde() and simulate_sample_size(), always evaluate user-supplied bracket endpoints, validate that the target power is actually bracketed, and never default power_at_* to 0.0 for unevaluated points.
Add regression tests for ContinuousDiD under a null effect and for explicit effect_range / n_range edge cases, including narrow and unbracketed ranges.

Methodology

Severity: P0. Impact: The new ContinuousDiD power profile does not simulate the same estimand it evaluates. _continuous_dgp_kwargs() maps treatment_effect into the DGP’s att_slope only, while the generator keeps att_intercept=1.0 by default and defines linear treatment effects as att_intercept + att_slope * d. But _extract_continuous() pulls result.overall_att, and simulate_power() stores the input scalar as true_effect for bias/coverage. The Methodology Registry defines ContinuousDiD targets as ATT(d), ATT^{glob}, and ACRT(d), not the slope parameter. So this profile is wrong in two ways: treatment_effect=0 is still non-null by default, and even with att_intercept=0, the reported truth is still not the returned estimand. Concrete fix: either drop ContinuousDiD from _get_registry() for now, or add a ContinuousDiD-specific power API that targets a documented scalar quantity and computes the truth from the generated dose-response, not from the input slope. Evidence: diff_diff/power.py:127, diff_diff/power.py:273, diff_diff/power.py:1538, diff_diff/prep_dgp.py:780, diff_diff/prep_dgp.py:892, docs/methodology/REGISTRY.md:459

Code Quality

Severity: P1. Impact: The new bisection helpers do not correctly handle explicit search ranges. When effect_range / n_range is provided, the code just assigns lo, hi and skips endpoint evaluation. best_power is then initialized from an empty search_path to 0.0, and simulate_sample_size() can even exit before simulating anything when hi - lo <= 2. That means these functions can return an unchecked upper bound labeled as mde / required_n, together with fabricated power_at_mde / power_at_n = 0.0. Concrete fix: evaluate both endpoints whenever a bracket is supplied, require or explicitly warn/error if the target is not bracketed, and only report power_at_* for values that were actually simulated. Evidence: diff_diff/power.py:1842, diff_diff/power.py:1879, diff_diff/power.py:2016, diff_diff/power.py:2033, diff_diff/power.py:2036

Performance

No material findings.

Maintainability

No material findings beyond the correctness issues above.

Tech Debt

No separate tech-debt-only findings. The issues above are not tracked in TODO.md:38, so they remain unmitigated for assessment purposes.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The added tests do not exercise the new failure modes. There is an existing DID null-effect calibration test, but the new ContinuousDiD coverage only checks that a result object is finite, and all new MDE/sample-size tests are DID-only. There is also no test for explicit effect_range / n_range inputs. Concrete fix: add tests that assert ContinuousDiD null behavior is calibrated or explicitly unsupported, plus tests that explicit brackets are evaluated/validated and that unbracketed ranges fail loudly. Evidence: tests/test_power.py:390, tests/test_power.py:980, tests/test_power.py:1001, tests/test_power.py:1106

Review note: I could not execute the test suite in this environment because the local Python environment is missing numpy, so this is a static diff review.

…bisection brackets P0: Remove ContinuousDiD from power analysis registry — the DGP simulated att_slope as treatment_effect but extracted overall_att (a dose-weighted average), so the simulated truth and evaluated estimand didn't match. P1: Add endpoint evaluation for user-supplied effect_range/n_range in simulate_mde() and simulate_sample_size(). Previously these paths skipped evaluation entirely, leaving search_path empty and best_power at 0.0. Now warns when brackets don't contain the target power. P2: Add regression tests for all three fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T13:18:35Z

/ai-review

github-actions · 2026-03-18T13:26:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 11539e7a01a24a990ff517c75c04099557578681

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior ContinuousDiD blocker is resolved: it is no longer auto-registered for simulation power, and there is now a regression test for that path.
The new automatic TwoWayFixedEffects registry entry is still methodologically wrong for the default multi-period simulation path. It passes time="period" into an estimator that explicitly warns this does not estimate the standard DiD ATT.
The explicit-endpoint fix is only partial. simulate_sample_size() still never uses the lower-endpoint power, so it can return an oversized required_n when the lower bound already achieves target power.
The new tests are mostly smoke tests. They do not lock down either the TWFE methodology issue or the lower-bound sample-size bug.

Methodology

Severity: P1. Impact: the new TWFE registry profile is not aligned with the estimator’s documented estimand. _twfe_fit_kwargs() wires automatic power to time="period", but TwoWayFixedEffects.fit() explicitly warns that multi-valued time inputs estimate treated * period_number, not the standard treated * post ATT. That means the new “all built-in estimators are supported automatically” claim is false for TWFE under the default n_periods=4, and the resulting simulate_power(), simulate_mde(), and simulate_sample_size() outputs for TWFE are not methodologically valid. Concrete fix: do not auto-register TWFE for multi-period power simulation in its current form. Either restrict automatic TWFE support to validated two-period / binary-post designs, or change the estimator/power interface so time fixed effects and the post indicator are passed separately before claiming automatic support. Evidence: diff_diff/power.py:L139, diff_diff/power.py:L267, diff_diff/power.py:L1224, diff_diff/twfe.py:L96, docs/methodology/REGISTRY.md:L238, docs/methodology/REGISTRY.md:L253

Code Quality

Severity: P1. Impact: simulate_sample_size() still violates the basic bisection invariant because it evaluates the lower endpoint only for logging and never checks whether that lower bound already meets target power. If lo already satisfies the target, the function still bisects above it and can silently return a too-large required_n; the same bug exists in the auto-bracketing path when min_n is already sufficient. Concrete fix: evaluate and store power_lo, return lo immediately when power_lo >= power, and only enter bisection once the invariant power(lo) < target <= power(hi) holds. Evidence: diff_diff/power.py:L1988, diff_diff/power.py:L1999, diff_diff/power.py:L2016

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the correctness issues above.

Tech Debt

No separate deferred-work findings. These issues are not tracked in the code-review debt section of TODO.md:L38, so they remain unmitigated for assessment.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: the new tests would not catch either P1. test_twfe() only asserts that the call returns a finite-looking result, not that the TWFE registry path is methodologically valid, and the sample-size tests never cover the case where the lower search bound already achieves target power. The simulate_power() docstring also overstates support by saying all built-in estimators are automatically supported. Concrete fix: add regression tests that either reject unsupported multi-period TWFE power simulation or validate a corrected two-period TWFE path, plus tests where simulate_sample_size() must return the lower endpoint for both explicit n_range and auto-bracketed strong-effect cases; narrow the docstring claim until that coverage exists. Evidence: tests/test_power.py:L844, tests/test_power.py:L1177, tests/test_power.py:L1201, diff_diff/power.py:L1224

Path to Approval

Restrict or redesign automatic TWFE power support so it does not run the current multi-period time="period" path and report it as standard ATT power.
Fix simulate_sample_size() to use both bracket endpoints and short-circuit when the lower bound already meets target power.
Add regression tests for the two cases above, including an explicit lower-bound-sufficient sample-size search.

Static diff review only. I could not execute the test suite in this environment because the local Python environment is missing numpy.

…short-circuit - Remove TwoWayFixedEffects from power analysis registry (time="period" produces treated*period_number, not standard ATT) - Add early return in simulate_sample_size() when lower bound already achieves target power (both explicit n_range and auto-bracket paths) - Narrow docstring from "All" to "Most" built-in estimators - Add regression tests for TWFE exclusion and lo-sufficient scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T13:38:05Z

/ai-review

github-actions · 2026-03-18T13:46:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d79b042bd80e8d8a1afa453ae8397f4fb2e95b37

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

The two prior blockers from the last review appear resolved: TwoWayFixedEffects is no longer auto-registered for simulation power, and simulate_sample_size() now short-circuits when the lower bound already meets target power, with regression coverage for both cases diff_diff/power.py:L249-L331, tests/test_power.py:L803-L813, diff_diff/power.py:L1972-L2003, tests/test_power.py:L1229-L1256. One new P1 remains.

Executive Summary

The prior TWFE methodology issue is fixed: TwoWayFixedEffects is now excluded from the automatic estimator registry, with a regression test asserting that it raises unless the caller supplies custom simulation plumbing diff_diff/power.py:L249-L331, tests/test_power.py:L803-L813.
The prior lower-bound bug in simulate_sample_size() is fixed in both explicit and auto-bracketed search paths, and the new tests cover both variants diff_diff/power.py:L1972-L2003, tests/test_power.py:L1229-L1256.
Severity P1: [Newly identified] automatic SyntheticDiD support is not valid at the public defaults. The registry uses the factor-model DGP with the global default treatment_fraction=0.5, but SyntheticDiD’s default placebo variance explicitly requires more controls than treated units; otherwise it returns SE=0.0, which flows to NaN inference, and simulate_power() then records those NaN p-values as non-rejections diff_diff/power.py:L92-L109, diff_diff/power.py:L316-L321, diff_diff/power.py:L1188-L1194, diff_diff/power.py:L1407-L1423, diff_diff/synthetic_did.py:L399-L431, diff_diff/synthetic_did.py:L720-L730, docs/methodology/REGISTRY.md:L1112-L1119.
The new tests sidestep that problem by only exercising SyntheticDiD at treatment_fraction=0.3, so the broken default contract is still untested tests/test_power.py:L959-L970.
No additional material findings in performance, maintainability, security, or deferred-work tracking.

Methodology

Severity: P1. Impact: [Newly identified] the new registry-backed “automatic support” path for SyntheticDiD is methodologically invalid at the default simulation settings. _factor_dgp_kwargs() turns the shared default treatment_fraction=0.5 into equal treated/control counts, the registry wires SyntheticDiD to that DGP, and the estimator’s documented placebo variance then returns SE=0.0 when n_control <= n_treated. Because simulate_power() maps NaN p-values to False rejections, the default simulate_power(SyntheticDiD()) path can report zero/near-zero power and distort the derived simulate_mde() / simulate_sample_size() searches for the wrong reason. Concrete fix: either fail fast for SyntheticDiD(variance_method="placebo") when the generated design has n_control <= n_treated, or make the auto-supported configuration valid before advertising automatic support (for example, require a control-heavy split or bootstrap variance). References: diff_diff/power.py:L92-L109, diff_diff/power.py:L316-L321, diff_diff/power.py:L1188-L1194, diff_diff/power.py:L1407-L1423, diff_diff/power.py:L1763-L1769, diff_diff/power.py:L1962-L1969, diff_diff/synthetic_did.py:L399-L431, diff_diff/synthetic_did.py:L720-L730, docs/methodology/REGISTRY.md:L1016-L1026, docs/methodology/REGISTRY.md:L1112-L1119.

Code Quality

No material findings beyond the methodology issue above.

Performance

No material findings in the changed code.

Maintainability

No material findings.

Tech Debt

No separate deferred-work findings. The remaining P1 is not tracked under the code-review debt section of TODO.md:L38-L64, so it remains unmitigated for assessment.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: the new test coverage does not exercise the public default SyntheticDiD auto-support path. test_synthetic_did() only uses treatment_fraction=0.3, and the new MDE/sample-size tests only cover DifferenceInDifferences, so the current default-path regression would not be caught by the suite. Concrete fix: add regression tests for simulate_power(SyntheticDiD(), progress=False) at defaults, plus corresponding simulate_mde() / simulate_sample_size() coverage, and assert either valid finite inference or an explicit fail-fast contract. References: tests/test_power.py:L959-L970, tests/test_power.py:L978-L1103, tests/test_power.py:L1111-L1256.

Path to Approval

Make the automatic SyntheticDiD simulation path valid or explicitly unsupported. Concretely: detect the default placebo-variance case where n_control <= n_treated and either raise with a clear message, or switch the supported automatic configuration to one that yields valid inference.
Add regression tests covering the public default SyntheticDiD path in simulate_power(), and the same configuration through simulate_mde() and simulate_sample_size().

Static review only. I could not run the test suite in this environment because pytest is unavailable, and local Python imports fail because numpy is not installed.

…ro-power bug Add _sdid_dgp_kwargs that caps treatment_fraction at 0.4 so the placebo variance method has enough pseudo-controls (n_control > n_treated). Add regression test for the default SyntheticDiD path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T14:12:57Z

/ai-review

github-actions · 2026-03-18T14:20:31Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 32f2b2b52bb09b1d35bcbe6a9c5b5f8dab5e3ee8

Overall Assessment

⛔ Blocker

Highest unmitigated severity: P0.

Executive Summary

The prior re-review blockers around TwoWayFixedEffects auto-registration and the simulate_sample_size() lower-bound short-circuit appear resolved in the changed code and tests: diff_diff/power.py:L270-L352, tests/test_power.py:L805-L815, diff_diff/power.py:L1993-L2024, tests/test_power.py:L1246-L1269.
The previous SyntheticDiD default-path zero-power failure is no longer present in that exact form, but the replacement fix is not acceptable: the new registry silently rewrites any requested treatment_fraction > 0.4 to 0.4 for SyntheticDiD: diff_diff/power.py:L112-L130.
That silent rewrite conflicts with the public API contract for simulate_power() and propagates into simulate_mde() and simulate_sample_size(), so the reported power/MDE/required-n can describe a different design than the caller asked for: diff_diff/power.py:L1242-L1244, diff_diff/power.py:L1769-L1786, diff_diff/power.py:L1968-L1986.
The Methodology Registry documents a different behavior for unsupported placebo designs: warn and return undefined inference when n_control <= n_treated. It does not document silently changing the simulated treatment share: docs/methodology/REGISTRY.md:L1085-L1119.
The new tests do not catch this contract break. They only check that the private builders return dicts and that default SyntheticDiD power is nonzero; the new MDE/sample-size suites still only cover DifferenceInDifferences: tests/test_power.py:L697-L729, tests/test_power.py:L975-L987, tests/test_power.py:L995-L1269.

Methodology

Severity: P0. Impact: Methods affected are the automatic SyntheticDiD power path plus the new simulation-based MDE/sample-size wrappers. _sdid_dgp_kwargs() unconditionally caps the treated share at 40%, even though simulate_power() documents treatment_fraction as the fraction of treated units. SyntheticDiD itself defaults to variance_method="placebo", and the registry already documents that placebo variance is undefined when n_control <= n_treated; that documented behavior is “warn + undefined inference,” not “silently simulate a different design.” Because simulate_mde() and simulate_sample_size() call simulate_power(), they inherit the same silent design rewrite. This is wrong statistical output with no warning. Concrete fix: preserve the caller’s requested treatment_fraction; if the estimator is SyntheticDiD with placebo variance and the requested design has n_control <= n_treated, fail fast with a clear error/warning, or require an explicit bootstrap configuration chosen by the caller. Do not silently mutate the DGP. References: diff_diff/power.py:L112-L130, diff_diff/power.py:L1242-L1244, diff_diff/power.py:L1769-L1786, diff_diff/power.py:L1968-L1986, diff_diff/synthetic_did.py:L131-L179, diff_diff/synthetic_did.py:L720-L730, docs/methodology/REGISTRY.md:L1113-L1119, docs/methodology/REGISTRY.md:L1706-L1723.

Code Quality

No material findings beyond the methodology blocker above.

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the test gap below.

Tech Debt

No relevant entry under TODO.md:L38-L64 mitigates the blocker above, so it remains unmitigated for assessment.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: the new tests validate that the default SyntheticDiD path no longer returns zero power, but they do not verify that the requested design was preserved, and they do not exercise SyntheticDiD through simulate_mde() or simulate_sample_size(). That is why the silent treatment_fraction rewrite passes the suite. Concrete fix: add public-API regression tests that assert either (a) the realized treated share matches the requested treatment_fraction, or (b) unsupported placebo designs raise/warn explicitly; run the same contract through simulate_power(), simulate_mde(), and simulate_sample_size() for SyntheticDiD. References: tests/test_power.py:L697-L729, tests/test_power.py:L975-L987, tests/test_power.py:L995-L1269.

Path to Approval

Remove the unconditional 40% cap from _sdid_dgp_kwargs() so the simulation respects the caller’s treatment_fraction.
Add an explicit contract for unsupported SyntheticDiD placebo designs: if n_control <= n_treated, raise with a clear message or require an explicit bootstrap configuration, rather than silently changing the design.
Add regression coverage for SyntheticDiD across simulate_power(), simulate_mde(), and simulate_sample_size() that checks the contract above.

Static review only. I could not run pytest here because pytest is unavailable, and local imports fail because numpy is not installed.

…n cap with fail-fast validation Remove _sdid_dgp_kwargs that silently capped treatment_fraction at 0.4. Add upfront ValueError in simulate_power() when SyntheticDiD placebo variance has n_control <= n_treated, with actionable error message suggesting lowering treatment_fraction or using bootstrap variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T14:37:35Z

/ai-review

github-actions · 2026-03-18T14:43:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4fe4a7ab2714d25bd64c6742359b04631c77e2e1

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior SyntheticDiD blocker is resolved: the code now fails fast for placebo designs with n_control <= n_treated, the behavior is documented in the Methodology Registry, and tests cover simulate_power(), simulate_mde(), and simulate_sample_size().
[Newly identified] The registry refactor leaves TwoWayFixedEffects outside the automatic power-analysis path, so the default simulation APIs now error for a first-class public estimator. The new tests explicitly lock that unsupported state in.
The new "provide a custom data_generator and estimator_kwargs" fallback is not actually generic. Unregistered estimators with non-basic fit() signatures, such as ContinuousDiD, still receive hard-coded basic-DiD kwargs and cannot use the advertised escape hatch.
Test coverage is broad for registered estimators and the SyntheticDiD fail-fast path, but it adds no positive regression coverage for TwoWayFixedEffects and no test of the advertised custom fallback for non-basic estimators.

Methodology

Severity: P3. Impact: The previous SyntheticDiD placebo-variance issue appears addressed. The simulation APIs now fail fast when placebo variance would violate n_control > n_treated, and REGISTRY.md documents that deviation explicitly, so this is not a defect. Concrete fix: none required. References: diff_diff/power.py:L1312-L1325, docs/methodology/REGISTRY.md:L1113-L1114, tests/test_power.py:L972-L1067

Code Quality

Severity: P1. Impact: TwoWayFixedEffects is still exported as a core public estimator, but the new registry omits it and simulate_power() now raises "not in registry"; simulate_mde() and simulate_sample_size() inherit the same failure. The suite then codifies that unsupported state. This is an undocumented contraction of the default power-analysis API introduced by the refactor. Concrete fix: add a dedicated TwoWayFixedEffects profile to _get_registry() and cover all three public simulation entry points. When you add it, use a binary post indicator in the fit kwargs (time="post", unit="unit"), not the old time="period" pattern, because the registry explicitly says multi-valued time does not estimate the standard TWFE ATT. References: diff_diff/init.py:L37-L42, diff_diff/init.py:L188-L200, diff_diff/power.py:L249-L331, diff_diff/power.py:L1298-L1307, docs/methodology/REGISTRY.md:L253-L257, tests/test_power.py:L803-L814

Performance

No material findings in the changed code.

Maintainability

Severity: P2. Impact: The new fallback contract for unregistered estimators is incomplete. The error path tells users to "Provide a custom data_generator and estimator_kwargs," but when profile is None the code still injects treatment="treated" and time="post" unconditionally. Estimators with non-basic fit() signatures, such as ContinuousDiD.fit(data, outcome, unit, time, first_treat, dose, ...), cannot actually use the advertised escape hatch. Concrete fix: either add a truly generic fit_kwargs/fit_kwargs_builder override that lets callers fully control the fit signature, or raise a targeted error for non-basic unregistered estimators instead of suggesting a path that still fails. References: diff_diff/power.py:L1302-L1307, diff_diff/power.py:L1408-L1418, diff_diff/continuous_did.py:L153-L161

Tech Debt

No relevant TODO.md entry mitigates the P1 or P2 findings above. Reference: TODO.md:L38-L64

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The new tests institutionalize TwoWayFixedEffects as unsupported instead of guarding against the regression, and they do not exercise the advertised custom-generator fallback for an unregistered non-basic estimator. That leaves both API-contract issues above unguarded. Concrete fix: replace the negative TwoWayFixedEffects test with positive coverage for simulate_power(), simulate_mde(), and simulate_sample_size(), and add one regression test that either proves the custom fallback works for a non-basic estimator or asserts a targeted "unsupported fit signature" error. References: tests/test_power.py:L803-L814, tests/test_power.py:L834-L1067, tests/test_power.py:L1075-L1352

Path to Approval

Add a TwoWayFixedEffects registry entry so the default simulation APIs cover that public estimator.
Make the unregistered-estimator fallback honest: either support fully caller-specified fit kwargs or fail with a targeted error instead of hard-coding basic-DiD kwargs.
Add regression coverage for TwoWayFixedEffects across simulate_power(), simulate_mde(), and simulate_sample_size(), plus one test covering the intended behavior for an unregistered non-basic estimator.

Static review only. I did not execute the test suite in this environment.

…tor fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T18:21:14Z

/ai-review

github-actions · 2026-03-18T18:29:01Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5d2fdaf892bbe20fd99ce97b96ba339bc9121ff3

Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Static review only. I could not execute the test suite in this environment because pytest and the scientific Python deps are not installed.

Executive Summary

The prior SyntheticDiD blocker is resolved: the simulation APIs now fail fast when placebo variance would violate n_control > n_treated, that deviation is documented in the Methodology Registry, and tests cover simulate_power(), simulate_mde(), and simulate_sample_size().
The prior TwoWayFixedEffects blocker is resolved: TWFE is back in the estimator registry, and the registry now uses a binary post indicator, which matches the Registry guidance for standard TWFE ATT.
The prior “custom fallback” concern is only partially resolved: caller-controlled estimator_kwargs fixed the old fit-signature problem, but unregistered estimators with nonstandard result objects still cannot use the advertised escape hatch.
The power docs remain stale and unguarded: docs/api/power.rst still documents nonexistent PowerAnalysis.compute_* methods and old simulate_power() arguments, and the new snippet smoke test never exercises that file.
The new TODO.md entry correctly tracks the remaining blanket-NameError limitation in doc-snippet smoke tests, so that item is informational only.

Methodology

Severity: P3. Impact: No unmitigated methodology findings remain in the changed power-analysis path. The prior SyntheticDiD placebo-variance deviation is now explicitly documented, and the TWFE power path now uses a binary post indicator consistent with the Registry note that multi-valued time does not estimate the standard TWFE ATT. Concrete fix: none. References: diff_diff/power.py:L1330-L1343, diff_diff/power.py:L139-L145, diff_diff/power.py:L267-L272, diff_diff/twfe.py:L96-L105, docs/methodology/REGISTRY.md:L253-L257, docs/methodology/REGISTRY.md:L1113-L1114, tests/test_power.py:L966-L1095

Code Quality

No material findings.

Performance

No material findings.

Maintainability

Severity: P2. Impact: The advertised unregistered-estimator fallback is still not truly generic. When profile is None, simulate_power() now lets callers fully control estimator.fit() kwargs, but result extraction still assumes either .att/.se/.p_value/.conf_int or .avg_*. Public unregistered estimators with different result schemas, such as ContinuousDiD (overall_att, overall_att_se, overall_att_p_value, overall_att_conf_int), will still fail even if the caller supplies a correct custom DGP wrapper and fit kwargs. That means the previous “generic fallback” concern is only partially resolved. Concrete fix: add a caller-supplied result_extractor hook, or extend the fallback extractor to handle overall_* result objects; then add a regression test using ContinuousDiD with explicit data_generator and estimator_kwargs. References: diff_diff/power.py:L1318-L1446, diff_diff/continuous_did.py:L153-L161, diff_diff/continuous_did_results.py:L112-L118, tests/test_power.py:L1097-L1138

Tech Debt

Severity: P3. Impact: tests/test_doc_snippets.py still treats any NameError as a pass, which can mask standalone snippet breakage, but this PR explicitly tracks that limitation in TODO.md, so it is mitigated and non-blocking under the review policy. Concrete fix: none required for approval; future work is to replace the blanket NameError catch with a per-snippet allowlist or explicit fixture setup. References: tests/test_doc_snippets.py:L357-L363, TODO.md:L62-L65

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The power documentation is still out of sync with the implementation, and the new smoke-test coverage does not protect it. docs/api/power.rst still advertises nonexistent PowerAnalysis.compute_* methods and stale simulate_power() parameters, while the newly exported simulate_mde, simulate_sample_size, SimulationMDEResults, and SimulationSampleSizeResults are absent from the API index. Because tests/test_doc_snippets.py does not include docs/api/power.rst, this broken documentation path remains untested. Concrete fix: update docs/api/power.rst to the current API (power, mde, sample_size, simulate_mde, simulate_sample_size, and both new result classes), add the new symbols to docs/api/index.rst and autosummary, and include api/power.rst in RST_FILES. References: docs/api/power.rst:L33-L65, docs/api/power.rst:L145-L160, docs/api/index.rst:L148-L155, tests/test_doc_snippets.py:L25-L40, diff_diff/init.py:L53-L64, diff_diff/init.py:L295-L306

igerber · 2026-03-18T18:54:30Z

/ai-review

…sample size - Add estimator registry mapping all 12 estimators to appropriate DGP, fit kwargs, and result extraction profiles - Refactor simulate_power() to use registry lookup instead of hardcoded if/elif chain (was 4 estimators, now 12 + custom fallback) - Add simulate_mde() for bisection search over effect sizes - Add simulate_sample_size() for bisection search over n_units - Add SimulationMDEResults and SimulationSampleSizeResults dataclasses - 34 new tests: registry, estimator coverage, MDE search, sample size search Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…bisection brackets P0: Remove ContinuousDiD from power analysis registry — the DGP simulated att_slope as treatment_effect but extracted overall_att (a dose-weighted average), so the simulated truth and evaluated estimand didn't match. P1: Add endpoint evaluation for user-supplied effect_range/n_range in simulate_mde() and simulate_sample_size(). Previously these paths skipped evaluation entirely, leaving search_path empty and best_power at 0.0. Now warns when brackets don't contain the target power. P2: Add regression tests for all three fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…short-circuit - Remove TwoWayFixedEffects from power analysis registry (time="period" produces treated*period_number, not standard ATT) - Add early return in simulate_sample_size() when lower bound already achieves target power (both explicit n_range and auto-bracket paths) - Narrow docstring from "All" to "Most" built-in estimators - Add regression tests for TWFE exclusion and lo-sufficient scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ro-power bug Add _sdid_dgp_kwargs that caps treatment_fraction at 0.4 so the placebo variance method has enough pseudo-controls (n_control > n_treated). Add regression test for the default SyntheticDiD path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…n cap with fail-fast validation Remove _sdid_dgp_kwargs that silently capped treatment_fraction at 0.4. Add upfront ValueError in simulate_power() when SyntheticDiD placebo variance has n_control <= n_treated, with actionable error message suggesting lowering treatment_fraction or using bootstrap variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tor fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… API symbols - Add `result_extractor` parameter to simulate_power, simulate_mde, and simulate_sample_size for unregistered estimators with non-standard schemas - Fix power.rst: correct PowerAnalysis method names, example code, and add SimulationMDEResults/SimulationSampleSizeResults/simulate_mde/simulate_sample_size - Add 4 missing symbols to docs/api/index.rst autosummary - Add api/power.rst to doc snippet smoke tests - Add tests for custom result_extractor and MDE forwarding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-18T19:08:40Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a15ce5a4d7a4a318b9edb024152b61cf8307004c

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review against the provided diff and surrounding source context; I did not execute the test suite in this environment.

Executive Summary

Prior blockers from the last review appear addressed: TwoWayFixedEffects is back in the power registry with a binary post path, and the SyntheticDiD placebo fail-fast behavior is now documented in the Methodology Registry.
One unmitigated P1 remains in the new automatic power-analysis path: the staggered-estimator registry hardcodes a single-cohort + never-treated DGP and does not adapt to estimator-level settings like control_group, clean_control, or anticipation.
That means simulate_power(), and by extension simulate_mde() / simulate_sample_size(), can silently report power for a different design than the configured estimator object the caller passed.
The new power tests mostly cover default estimator instances, so those parameter-dependent staggered paths are not exercised.
The doc-snippet NameError masking limitation is now tracked in TODO.md, so it is informational only.

Methodology

Severity: P1. Impact: the new staggered power registry is not parameter-aware. _staggered_dgp_kwargs() always forces cohort_periods=[treatment_period], never_treated_frac=1 - treatment_fraction, and dynamic_effects=False, and that shared profile is reused for all staggered estimators in the registry. generate_staggered_data() itself supports multiple cohorts, but the adapter collapses it to a single treated cohort, which breaks the documented comparison-group and anticipation semantics for configurations such as CallawaySantAnna(control_group="not_yet_treated"), StackedDiD(clean_control="not_yet_treated"|"strict"), and staggered estimators with anticipation>0. Callers can work around this manually with data_generator_kwargs, but the new “automatic” path does not infer or validate those settings, so power can be computed for the wrong estimand with no warning. Concrete fix: either make the registry DGP builder consume estimator-instance settings that change the simulated design, or explicitly reject/warn on unsupported staggered configurations and require matching data_generator_kwargs / data_generator. References: diff_diff/power.py:L73-L89, diff_diff/power.py:L281-L323, diff_diff/prep_dgp.py:L125-L153, diff_diff/prep_dgp.py:L207-L225, diff_diff/staggered.py:L420-L459, diff_diff/staggered.py:L1136-L1148, diff_diff/stacked_did.py:L58-L74, diff_diff/stacked_did.py:L514-L523, diff_diff/stacked_did.py:L565-L572, docs/methodology/REGISTRY.md:L351-L355, docs/methodology/REGISTRY.md:L409-L425, docs/methodology/REGISTRY.md:L928-L995

Code Quality

No material findings in the changed code beyond the methodology issue above.

Performance

No material findings.

Maintainability

No material findings.

Tech Debt

Severity: P3. Impact: the new doc-snippet smoke test still treats any NameError as a pass, which can mask standalone example breakage, but this PR also adds that limitation to TODO.md, so it is properly tracked and non-blocking under the review policy. Concrete fix: none required for approval; later, replace blanket NameError suppression with per-snippet setup or an allowlist. References: tests/test_doc_snippets.py:L357-L363, TODO.md:L60-L65

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: the new estimator-coverage tests only instantiate default estimator objects, so the regression suite never exercises the new auto-DGP path for control_group="not_yet_treated", clean_control="strict"/"not_yet_treated", or anticipation>0. That leaves the methodology mismatch above unguarded. Concrete fix: add explicit simulation tests for those configurations across simulate_power(), simulate_mde(), and simulate_sample_size(). References: tests/test_power.py:L828-L1095

Path to Approval

Update the staggered power registry so the default DGP matches estimator settings that change identification/comparison design (control_group, clean_control, cohort structure, anticipation), or refuse those configurations unless callers supply matching data_generator_kwargs / data_generator.
Add regression tests for at least CallawaySantAnna(control_group="not_yet_treated") with data_generator_kwargs={"never_treated_frac": 0.0, "cohort_periods": [2, 4]}, StackedDiD(clean_control="strict", kappa_pre=1, kappa_post=1) with later-treated clean controls, and one anticipation>0 staggered estimator case where the DGP matches the shifted effect onset/reference period.
If any automatic-simulation limitation remains intentional, document it in docs/methodology/REGISTRY.md with a Note or Deviation label and emit a runtime warning when the registry default DGP is not faithful to the configured estimator.

Add _check_staggered_dgp_compat() that detects control_group="not_yet_treated", clean_control="strict", and anticipation>0 on staggered estimators using the automatic single-cohort DGP, and emits UserWarning with specific guidance on supplying data_generator_kwargs or a custom data_generator. - 5 warning/no-warning unit tests + 2 slow regression tests with matching DGP - Document limitation in REGISTRY.md PowerAnalysis section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T19:59:54Z

/ai-review

github-actions · 2026-03-18T20:07:03Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 800d37f224f76b2218d164098a3fdec8d3797e6d

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review of the provided diff and surrounding source; I did not execute the test suite in this environment because the package import fails without numpy.

Executive Summary

The prior staggered-registry blocker is mitigated: the single-cohort auto-DGP limitation is now documented in the Methodology Registry, surfaced as UserWarnings, and covered by regression tests.
One P1 remains in the new SyntheticDiD power path: the placebo-variance fail-fast ignores data_generator_kwargs overrides that change the effective treated/control split.
That allows simulate_power(), simulate_mde(), and simulate_sample_size() to return numeric output for an infeasible placebo design instead of raising the documented ValueError.
The new SDID tests only cover the treatment_fraction boundary, so the override-based regression is not exercised.
I did not find additional material issues in code quality, performance, maintainability, security, or docs beyond that gap.

Methodology

Severity: P1. Impact: For SyntheticDiD(variance_method="placebo"), the new prevalidation computes n_treated only from treatment_fraction, but the actual factor-model DGP kwargs are merged later and can override n_treated. Because generate_factor_data() accepts n_treated directly, a call like data_generator_kwargs={"n_treated": 30} can create n_control <= n_treated after the fail-fast has already passed. In that case SyntheticDiD falls back to se=0.0 and empty placebo effects, safe_inference() turns inference into NaNs, and simulate_power() converts the NaN p-values into non-rejections, yielding a numeric power result for an invalid placebo-variance design instead of the documented error. Concrete fix: build the effective factor-DGP kwargs before validating, derive the treated/control counts from the merged kwargs (or validate the generated data before fitting), and reuse that check in simulate_power() so simulate_mde() and simulate_sample_size() inherit it. References: diff_diff/power.py:L92-L109, diff_diff/power.py:L1409-L1425, diff_diff/power.py:L1490-L1500, diff_diff/prep_dgp.py:L270-L301, diff_diff/synthetic_did.py:L720-L730, diff_diff/synthetic_did.py:L423-L431, diff_diff/utils.py:L186-L191, docs/methodology/REGISTRY.md:L1113-L1114.
Severity: P3. Impact: The earlier automatic staggered-DGP mismatch is now a documented limitation with runtime warnings for control_group="not_yet_treated", clean_control="strict", and anticipation>0; under the review policy that makes it informational rather than blocking. Concrete fix: none required for approval. References: diff_diff/power.py:L254-L310, docs/methodology/REGISTRY.md:L1711-L1715, tests/test_power.py:L1197-L1274.

Code Quality

No material findings in the changed code beyond the methodology issue above.

Performance

No material findings.

Maintainability

No material findings.

Tech Debt

No material findings newly introduced by this diff.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The new SDID regression coverage only checks the placebo boundary through treatment_fraction; it never exercises the supported data_generator_kwargs override path that changes the actual treated count, so the P1 above shipped untested. Concrete fix: add simulate_power(), simulate_mde(), and simulate_sample_size() tests using SyntheticDiD(), n_units=50, treatment_fraction=0.3, and data_generator_kwargs={"n_treated": 30} (or any override that makes n_control <= n_treated) and assert the same ValueError as the high-treatment_fraction path. References: tests/test_power.py:L968-L1030, diff_diff/power.py:L1409-L1500.

Path to Approval

Move the SyntheticDiD placebo-feasibility validation to the effective DGP parameters after data_generator_kwargs are merged, so registry-based overrides of n_treated cannot bypass the documented ValueError.
Add regression tests for all three simulation helpers covering the override case, not just the treatment_fraction case.
If the fail-fast guarantee is intentionally limited to the default factor DGP, narrow the REGISTRY.md note/error text accordingly; otherwise keep the current note and make the validation match it.

… override Move the SyntheticDiD placebo-variance fail-fast after kwargs merge so that data_generator_kwargs={"n_treated": N} overrides are caught. Add tests for the override path across simulate_power, simulate_mde, and simulate_sample_size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T20:17:45Z

/ai-review

github-actions · 2026-03-18T20:27:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9e18a3ea05f874da0a9bc63f1f66fc0eb5f2f343

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. Static review only; I could not execute the test suite in this environment because pytest and numpy are unavailable.

Executive Summary

The prior SyntheticDiD placebo-validation blocker is fixed: the fail-fast now respects merged data_generator_kwargs, the behavior is documented in the Methodology Registry, and regression tests cover all three simulation helpers. See diff_diff/power.py:L1412, docs/methodology/REGISTRY.md:L1113, tests/test_power.py:L1032.
The earlier staggered auto-DGP mismatch concern is now documented with **Note:** language and runtime warnings, so that deviation is informational rather than blocking. See diff_diff/power.py:L254, docs/methodology/REGISTRY.md:L1715, tests/test_power.py:L1236.
One P1 remains on the new TripleDifference registry path: the adapter silently forces a fixed 2x2x2 DDD DGP, ignores several public design inputs, and only approximates n_units, so simulate_power(), simulate_mde(), and simulate_sample_size() can report results for a different design than the caller requested. See diff_diff/power.py:L112, diff_diff/power.py:L1493, diff_diff/prep_dgp.py:L410.
The new tests only exercise TripleDifference in the aligned n_units=80 simulate_power() case and do not cover the new DDD MDE/sample-size paths or non-representable n_units, so this behavior is currently unguarded. See tests/test_power.py:L930, tests/test_power.py:L1353, tests/test_power.py:L1486.
I did not find additional material issues in security, performance, or general code quality.

Methodology

Severity: P1. Impact: TripleDifference is now auto-supported through the registry, but _ddd_dgp_kwargs() drops n_periods, treatment_period, and treatment_fraction, and maps n_units to n_per_cell=max(2, n_units // 8). Since generate_ddd_data() is a fixed 8-cell, 2-period DGP, the simulation helpers can silently evaluate a different design than the one the API call specifies; for non-multiples of 8 the effective sample size is rounded, and for n_units < 16 it is rounded up to 16. There is no corresponding **Note:** in the power-analysis registry section documenting this DDD-specific deviation. Concrete fix: make the DDD path validate unsupported inputs (n_periods, treatment_period, treatment_fraction) and require an exact total-sample mapping, or expose/report n_per_cell and the effective total N explicitly so the returned power/MDE/sample-size matches the simulated design. See diff_diff/power.py:L112, diff_diff/power.py:L1493, diff_diff/power.py:L1890, diff_diff/power.py:L2094, diff_diff/prep_dgp.py:L410, docs/methodology/REGISTRY.md:L1707.
Severity: P3. Impact: The previous SyntheticDiD placebo-variance override gap is mitigated, and the staggered single-cohort auto-DGP limitations are now documented/warned in a way that matches the review policy. Concrete fix: none. See diff_diff/power.py:L1412, diff_diff/power.py:L254, docs/methodology/REGISTRY.md:L1114, docs/methodology/REGISTRY.md:L1715, tests/test_power.py:L1032, tests/test_power.py:L1236.

Code Quality

No material findings beyond the methodology issue above.

Performance

No material findings.

Maintainability

No material findings beyond the need to make the TripleDifference adapter semantics explicit.

Tech Debt

No separate tech-debt-only findings. The TripleDifference adapter issue above is not tracked in TODO.md, so it remains unmitigated for approval purposes. See TODO.md:L38.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The new test coverage does not exercise the DDD adapter’s parameter translation edge cases. TripleDifference is only tested in simulate_power() with n_units=80, which happens to map cleanly to 10 observations per cell, and there are no DDD-specific simulate_mde() or simulate_sample_size() cases. That leaves the P1 above unguarded. Concrete fix: add TripleDifference regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() with values like n_units=10 and n_units=65, and assert the intended behavior for unsupported n_periods/treatment_period/treatment_fraction and for effective sample-size handling. See tests/test_power.py:L930, tests/test_power.py:L1353, tests/test_power.py:L1486.

Path to Approval

Make the TripleDifference registry path enforce a one-to-one mapping between public simulation inputs and the actual DDD DGP: validate or reject unsupported n_periods, treatment_period, and treatment_fraction, and stop silently rounding n_units without surfacing the effective design.
Add DDD-specific regression tests for simulate_power(), simulate_mde(), and simulate_sample_size() covering non-multiples of 8 and unsupported generic design knobs.
If any DDD-specific rounding or fixed-2x2x2 behavior is intentional, document it in docs/methodology/REGISTRY.md with a **Note:** and mirror that behavior in the public docstrings/results so the simulated design is explicit to callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T20:54:06Z

/ai-review

github-actions · 2026-03-18T21:01:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0e912bbed172b3958c4e87088e35665c9a0b17c5

Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only; I could not execute pytest in this environment because both pytest and numpy are unavailable.

Executive Summary

The prior TripleDifference blocker from the last review is now mitigated: the fixed 2×2×2 DGP limitation is documented in the Methodology Registry and tracked in TODO.md, so that deviation is informational rather than blocking. See docs/methodology/REGISTRY.md:L1720-L1721, TODO.md:L42-L50.
simulate_sample_size() now has a silent correctness bug: in the auto-bracketing path it can return an internal registry floor as required_n without checking any smaller sample sizes, even though the API says it finds the smallest N achieving target power. See diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2132-L2204, tests/test_power.py:L1747-L1758.
The new DDD compatibility warning is only partially fixed: passing data_generator_kwargs={"n_per_cell": ...} suppresses warnings for still-ignored n_periods, treatment_period, and treatment_fraction, which contradicts the registry note that a warning is emitted when inputs differ from the effective DDD design. See diff_diff/power.py:L313-L359, docs/methodology/REGISTRY.md:L1721, tests/test_power.py:L1023-L1035.
The new tests currently codify both behaviors, so the required fixes will otherwise look like regressions. See tests/test_power.py:L1023-L1035, tests/test_power.py:L1747-L1758.

Methodology

Severity: P0. Impact: simulate_sample_size() and SimulationSampleSizeResults.required_n are documented as returning the required/smallest sample size, but the auto path seeds lo from an internal registry floor and returns immediately if that floor already reaches target power. Because that branch never evaluates any n < min_n, it can silently report a non-minimal required_n. This is a wrong statistical output, not just a conservative implementation choice. Concrete fix: use min_n only as an initial guess for bracketing, then search downward to the estimator’s true feasibility floor before returning; if you intend to return an upper bound instead, rename/document the result accordingly and emit a warning. See diff_diff/power.py:L1774-L1777, diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2132-L2204, tests/test_power.py:L1747-L1758.
Severity: P1. Impact: _check_ddd_dgp_compat() suppresses all ignored-parameter warnings whenever n_per_cell is present in data_generator_kwargs, even though overriding n_per_cell does not make n_periods, treatment_period, or treatment_fraction meaningful to generate_ddd_data(). That leaves a silent design-mismatch path for DDD simulations and conflicts with the registry note promising a warning whenever inputs differ from the effective design. Concrete fix: let n_per_cell suppress only the effective-N rounding/clamping warning; keep warning on ignored n_periods, treatment_period, and treatment_fraction unless the caller supplies a fully custom data_generator. See diff_diff/power.py:L313-L359, docs/methodology/REGISTRY.md:L1721, tests/test_power.py:L1023-L1035.

Code Quality

No material findings beyond the correctness issues above.

Performance

No material findings.

Maintainability

No material findings beyond the need to separate “initial search guess” from “true minimum supported N” in the sample-size search path.

Tech Debt

Severity: P3. Impact: The earlier TripleDifference power-analysis limitation is now properly tracked in TODO.md and documented in REGISTRY.md, so that specific deviation is mitigated under the project’s review policy. Concrete fix: none for this PR. See TODO.md:L42-L50, docs/methodology/REGISTRY.md:L1720-L1721.

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The new tests lock in the two behaviors above instead of the documented contract: test_lo_already_sufficient_auto asserts the auto path should return the registry floor as required_n, and test_ddd_no_warn_n_per_cell_override asserts no warning in a still-ignored-parameter DDD path. That will hide the fixes behind failing tests. Concrete fix: update those assertions to match the corrected behavior, and add a DDD regression where n_per_cell is overridden but mismatched n_periods/treatment_fraction still warn. See tests/test_power.py:L1023-L1035, tests/test_power.py:L1747-L1758.

Path to Approval

Fix simulate_sample_size() so the auto path does not return min_n as required_n without ruling out smaller sample sizes; either search downward after the first successful bracket or explicitly convert the API/result semantics to “upper bound” with a warning.
Fix _check_ddd_dgp_compat() so n_per_cell only suppresses the effective-sample-size warning, not the warnings for ignored n_periods, treatment_period, and treatment_fraction.
Update the new tests to reflect the corrected contracts, especially tests/test_power.py:L1023-L1035 and tests/test_power.py:L1747-L1758.

… warning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T21:23:41Z

/ai-review

github-actions · 2026-03-18T21:41:09Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fa93beae35118378807aacbcecad8c3522477b21

Overall Assessment

⛔ Blocker — Highest unmitigated severity: P0.

Static review only. I could not run pytest or import the package in this environment because pytest and numpy are unavailable.

Executive Summary

The prior DDD warning-suppression finding appears resolved: _check_ddd_dgp_compat() now still warns on ignored DDD parameters when n_per_cell is overridden, and the fixed 2×2×2 DGP limitation is now documented in the Methodology Registry and tracked in TODO.md. diff_diff/power.py:L313-L361, docs/methodology/REGISTRY.md:L1720-L1721, TODO.md:L46-L50, tests/test_power.py:L1023-L1035
The new registry merge introduces a silent correctness bug: data_generator_kwargs can now overwrite builder-supplied core DGP inputs on registered estimators, so the simulated DGP can differ from the reported true_effect/search variable with no warning. diff_diff/power.py:L1554-L1564, diff_diff/power.py:L1645-L1664, diff_diff/power.py:L1933-L1952, diff_diff/power.py:L2137-L2156
The previous simulate_sample_size() blocker is only partially addressed. The new warning removes the silence, but auto mode still returns the registry floor as required_n without proving minimality, while the API/result docs still say “smallest” / “required” sample size. diff_diff/power.py:L1774-L1776, diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2191-L2209, tests/test_power.py:L1762-L1774
I did not find a new unmitigated source-paper mismatch in estimator formulas or variance estimators themselves; the blocking issues are in the new simulation infrastructure.

Methodology

Severity: P0. Impact: On the registered-estimator path, dgp_kwargs.update(data_gen_kwargs) silently lets data_generator_kwargs override builder-supplied core simulation inputs. A conflicting key such as treatment_effect will make simulate_power() report one true_effect while generating data with another, which in turn corrupts bias, coverage, and any simulate_mde() / simulate_sample_size() search that relies on simulate_power() as its objective. Before this refactor, duplicate kwargs on the explicit call path would have raised immediately; now they silently change the estimand/DGP. Concrete fix: reject conflicting overrides on the registry path for core/search-driving inputs instead of blindly update()-ing, with a narrow allowlist only for documented, intentional overrides such as DDD n_per_cell. Add regression tests for collisions on treatment_effect and the searched quantity. Location: diff_diff/power.py:L1554-L1564, diff_diff/power.py:L1645-L1664, diff_diff/power.py:L1933-L1952, diff_diff/power.py:L2137-L2156
Severity: P1. Impact: simulate_sample_size() still claims to find the “smallest N” / “required sample size,” but in auto mode it returns the registry floor as soon as that floor reaches target power, without evaluating any smaller n. The added warning helps, but the returned required_n semantics remain incorrect and undocumented in the registry. Concrete fix: either actually search below the registry floor until infeasibility is established, or rename/document the auto-path result everywhere as an upper bound/search floor rather than required_n. Location: diff_diff/power.py:L1774-L1776, diff_diff/power.py:L1818-L1828, diff_diff/power.py:L2072-L2075, diff_diff/power.py:L2191-L2209

Code Quality

No material findings beyond the correctness issues above.

Performance

No material findings.

Maintainability

No material findings beyond the unresolved required_n contract mismatch in simulate_sample_size().

Tech Debt

Severity: P3. Impact: The TripleDifference power-analysis limitation is now properly mitigated under project policy: the fixed 2×2×2 DGP is documented in the Methodology Registry and tracked in TODO.md, so that deviation is informational rather than blocking. Concrete fix: none for this PR. Location: docs/methodology/REGISTRY.md:L1720-L1721, TODO.md:L46-L50

Security

No material findings.

Documentation/Tests

Severity: P2. Impact: The new tests now lock in the auto-floor behavior by asserting that simulate_sample_size() should return 20 in the “registry floor already sufficient” case, and there is no regression coverage for the new duplicate-core-kwarg collision path on registered estimators. That means the remaining correctness issues will look intentional to the suite. Concrete fix: update the auto-floor expectation once simulate_sample_size() semantics are corrected, and add explicit tests that conflicting registry-path overrides such as data_generator_kwargs={"treatment_effect": ...} raise instead of silently changing the simulated DGP. Location: tests/test_power.py:L1762-L1774, diff_diff/power.py:L1554-L1564

Path to Approval

Prevent silent core-DGP overrides on the registry path. Reject conflicting data_generator_kwargs that would replace builder-supplied simulation knobs or search variables, and add regression tests covering simulate_power(), simulate_mde(), and simulate_sample_size().
Make simulate_sample_size() consistent with its contract: either search below the registry floor until the true minimum is identified, or explicitly change the API/result/docs to return a documented upper bound/search floor instead of required_n.
Update the tests to match the corrected sample-size semantics and to cover the duplicate-kwarg regression path.

igerber added a commit that referenced this pull request Mar 18, 2026

Address PR #208 review: add TWFE to registry, fix unregistered-estima…

5d2fdaf

…tor fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber and others added 7 commits March 18, 2026 15:00

Address PR #208 review: add TWFE to registry, fix unregistered-estima…

bbbacc0

…tor fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber force-pushed the power-analysis branch from a15ce5a to d1f2316 Compare March 18, 2026 19:00

Warn when TripleDifference power params don't match fixed 2×2×2 DGP

0e912bb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix DDD n_per_cell suppression scope and add sample-size auto-bracket…

fa93bea

… warning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Conversation

igerber commented Mar 18, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant